In this practical we will:

Areal (lattice) data

Areal data our measurements are summarised across a set of discrete, non-overlapping spatial units such as postcode areas, health board or pixels on a satellite image. In consequence, the spatial domain is a countable collection of (regular or irregular) areal units at which variables are observed. Many public health studies use data aggregated over groups rather than data on individuals - often this is for privacy reasons, but it may also be for convenience.

In the next example we are going to explore data on respiratory hospitalisations for Greater Glasgow and Clyde between 2007 and 2011. The data are available from the CARBayesdata R Package:

library(CARBayesdata)

data(pollutionhealthdata)
data(GGHB.IZ)

The pollutionhealthdata contains the spatiotemporal data on respiratory hospitalisations, air pollution concentrations and socio-economic deprivation covariates for the 271 Intermediate Zones (IZ) that make up the Greater Glasgow and Clyde health board in Scotland. Data are provided by the Scottish Government and the available variables are:

  • IZ: unique identifier for each IZ.
  • year: the year were the measruments were taken
  • observed: observed numbers of hospitalisations due to respiratory disease.
  • expected: expected numbers of hospitalisations due to respiratory disease computed using indirect standardisation from Scotland-wide respiratory hospitalisation rates.
  • pm10: Average particulate matter (less than 10 microns) concentrations.
  • jsa: The percentage of working age people who are in receipt of Job Seekers Allowance
  • price: Average property price (divided by 100,000).

The GGHB.IZ data is a Simple Features (sf) object containing the spatial polygon information for the set of 271 Intermediate Zones (IZ), that make up of the Greater Glasgow and Clyde health board in Scotland ( Figure 1 ).

Figure 1: Greater Glasgow and Clyde health board represented by 271 Intermediate Zones

Let’s start by loading useful libraries:

library(sf)
library(ggplot2)
library(scico)

The sf package allows us to work with vector data which is used to represent points, lines, and polygons. It can also be used to read vector data stored as a shapefiles.

First, lets combine both data sets based on the Intermediate Zones (IZ) variable using the merge function from base R:

resp_cases <- merge(GGHB.IZ, pollutionhealthdata, by = "IZ")

In epidemiology, disease risk is usually estimated using Standardized Mortality Ratios (SMR). The SMR for a given spatial areal unit \(i\) is defined as the ratio between the observed ( \(Y_i\) ) and expected ( \(E_i\) ) number of cases:

\[ SMR_i = \dfrac{Y_i}{E_i} \]

A value \(SMR > 1\) indicates that there are more observed cases than expected which corresponds to a high risk area. On the other hand, if \(SMR<1\) then there are fewer observed cases than expected, suggesting a low risk area.

We can manipulate sf objects the same way we manipulate standard data frame objects via the dplyr package. Lets use the pipeline command %>% and the mutate function to calculate the yearly SMR values for each IZ:

library(dplyr)
resp_cases <- resp_cases %>% 
  mutate(SMR = observed/expected, .by = year )

Now we use ggplot to visualize our data by adding a geom_sf layer and coloring it according to our variable of interest (i.e., SMR). We can further use facet_wrap to create a layer per year and chose an appropriate color palette using the scale_fill_scico from the scico package:

ggplot()+
  geom_sf(data=resp_cases,aes(fill=SMR))+
  facet_wrap(~year)+scale_fill_scico(palette = "roma")

Task

Produce a map that shows the spatial distribution of each of the following variables for the year 2011:

  • Average particulate matter pm10

  • Average property price price

  • Percentage of working age people who are in receipt of Job Seekers Allowance jsa

You can use the filter function from dplyr to subset the data according to the year of interest.

# Library for plotting multiple maps together

library(patchwork)

# subset data set for 2011

resp_cases_2011 <- resp_cases %>% filter(year ==2011)

# pm10 plot

pm10_plot <- ggplot()+
  geom_sf(data=resp_cases_2011,aes(fill=pm10))+
  scale_fill_scico(palette = "navia")

# property price

price_plot <- ggplot()+
  geom_sf(data=resp_cases_2011,aes(fill=price))+
  facet_wrap(~year)+scale_fill_scico(palette = "bilbao")

#  percentage jsa

jsa_plot <- ggplot()+
  geom_sf(data=resp_cases_2011,aes(fill=jsa))+
  facet_wrap(~year)+scale_fill_scico(palette = "lapaz") 

# plot maps together

pm10_plot + price_plot + jsa_plot + plot_layout(ncol=3)

As with the other types of spatial modelling, our goal is to observe and explain spatial variation in our data. Generally, we are aiming to produce a smoothed map which summarises the spatial patterns we observe in our data.

A key aspect of any spatial analysis is that observations closer together in space are likely to have more in common than those further apart. This can lead us towards approaches similar to those used in time series, where we consider the spatial closeness of our regions in terms of a neighbourhood structure.

The function poly2nb() of the spdep package can be used to construct a list of neighbors based on areas with contiguous boundaries (e.g., using Queen contiguity).

library(spdep)

W.nb <- poly2nb(GGHB.IZ,queen = TRUE)
W.nb
Neighbour list object:
Number of regions: 271 
Number of nonzero links: 1424 
Percentage nonzero weights: 1.938971 
Average number of links: 5.254613 
2 disjoint connected subgraphs

The warning tell us that the neighbourhood is comprised of two interconnected regions. By looking at the neighbourhood graph below, we can see that these are the North and South Glasgow regions which are separated by the River Clyde.

plot(st_geometry(GGHB.IZ), border = "lightgray")
plot.nb(W.nb, st_geometry(GGHB.IZ), add = TRUE)

You could use the snap argument within poly2nb to set a distance at which the different regions centroids are consider neighbours. To do so we first need to be aware about the spatial units of the spatial coordinate reference system (CRS). We can check this as follows:

st_crs(GGHB.IZ)$units
[1] "m"

Then, we could set a distance of 250m to join the IZ centroids that are are less than 250m apart.

W.nb250 <- poly2nb(GGHB.IZ,snap=250)
W.nb250
Neighbour list object:
Number of regions: 271 
Number of nonzero links: 1758 
Percentage nonzero weights: 2.393758 
Average number of links: 6.487085 
plot(st_geometry(GGHB.IZ), border = "lightgray")
plot.nb(W.nb250, st_geometry(GGHB.IZ), add = TRUE)

Once we have identified a set of neighbours using our chosen method, we can use this to account for correlation.

Moran’s \(I\) is a measure of global spatial autocorrelation, and can be considered an extension of the Pearson correlation coefficient. For a set of data \(Z_1, \ldots, Z_m\) measured on regions \(B_1, \ldots B_m\), with neighbourhood matrix \(W\), we can compute Moran’s I as:

\[ I = \frac{m}{\sum_{i=1}^m \sum_{j=1}^m w_{ij}}\frac{\sum_{i=1}^m \sum_{j=1}^m w_{ij} (Z_i - \bar{Z})(Z_j - \bar{Z})}{\sum_{i=1}^m (Z_i - \bar{Z})^2} \]

This is basically a function of differences in values between neighbouring areas. By far the most common approach is to use a binary neighbourhood matrix, \(W\), denoted by

\[ w_{ij} = \begin{cases} 1 & \text{if areas } (B_i, B_j) \text{ are neighbours.}\\ 0 & \text{otherwise.} \end{cases} \]

Binary matrices are used for their simplicity. Fitting spatial models often requires us to invert \(W\), and this is less computationally intensive for sparse matrices.

Moran’s \(I\) ranges between -1 and 1, and can be interpreted in a similar way to a standard correlation coefficient.

  • \(I=1\) implies that we have perfect spatial correlation.

  • \(I=0\) implies that we have complete spatial randomness.

  • \(I=-1\) implies that we have perfect dispersion (negative correlation).

Our observed \(I\) is a point estimate, and we may also wish to assess whether it is significantly different from zero. We can test for a statistically significant spatial correlation using a permutation test, with hypotheses:

\[ \begin{aligned} H_0&: \text{ negative or no spatial association } (I \leq 0)\\ H_1&: \text{ positive spatial association } (I > 0) \end{aligned} \]

We can use moran.test() to test this hypothesis by setting alternative = "greater". To do so, we need to supply list containing the neighbors via the nb2listw() function from the spdep package. Lets assess now the spatial autocorrelation of the SMR in 2011:

# subset the data
resp_cases_2011 <- resp_cases %>% filter(year ==2011)

# neighbors list 
nbw <- nb2listw(W.nb, style = "W")

# Global Moran's I
gmoran <- moran.test(resp_cases_2011$SMR, nbw,
                     alternative = "greater")
gmoran

    Moran I test under randomisation

data:  resp_cases_2011$SMR  
weights: nbw    

Moran I statistic standard deviate = 11.42, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.439899780      -0.003703704       0.001508809 
Question

What do we conclude from the Moran’s I test?

Since have set the alternative hypothesis to be $ I > 0 $ and have a p-value \(<0.05\), we then reject the null hypothesis and conclude there is evidence for positive spatial autocorrelation.

A local version of Moran’s I can also be used to measure the similarity between each IZ via the localmoran function:

lmoran <- localmoran(resp_cases_2011$SMR, nbw, alternative = "two.sided")

In this case we set alternative = "two.sided" to test whether there is any evidence of spatial autocorrelation in our data:

\[ \begin{aligned} H_0&: \text{ no spatial association } (I=0)\\ H_1&: \text{ some spatial association } (I \neq 0) \end{aligned} \]

We can obtain the \(Z\)-scores from the test (Z.Ii) where any values smaller than –1.96 indicate negative spatial autocorrelation, and z-score values greater than 1.96 indicate positive spatial autocorrelation:

resp_cases_2011_m <- resp_cases_2011 %>% mutate(Zscores = lmoran[,"Z.Ii"])

resp_cases_2011_m <- resp_cases_2011_m %>% 
  mutate (SAC = case_when(
  Zscores > qnorm(0.975) ~ " M > 1" ,
  Zscores < -1*qnorm(0.975) ~ " M < 1",
  .default = "M = 0"),
  SAC=as.factor(SAC)
)

We can visualize these results using either ggplot as we did before, or via the mapview library which contains interactive features:

library(mapview)
mapview(resp_cases_2011_m, 
        zcol = "SAC",
        layer.name = "SAC",
        col.regions=c("turquoise","orange","grey40"))